House prices report

This document is a data science report of the kaggle house prices tutorial project. It was generated using the Shapash library.

General Information

Version : 0.7

Name : House Prices Prediction Project

Purpose : Predicting the sale price of houses

Date : 2021-10-26

Contributors : Yann Golhen, Sebastien Bidault, Thomas Bouche, Guillaume Vignal, Thibaud Real

Description : This work is a data science project that tries to predict the sale of houses based on 79 explanatory variables. It was designed inside the data science team at X. and improved since the beggining of the project in 2019. The model was put into production since February 2021.

Git Commit : 1ff46e83beafba8949a7f3b7de27586acd6ae99e


Dataset Information

Origin : The Assessor’s Office

Description : the sale of individual residential property in Ames, Iowa

Depth : from 2006 to 2010

Perimeter : only residential sales

Target Variable : SalePrice

Target Description : The property's sale price in dollars


Data Preparation

Variable Filetring : All variables that required special knowledge or previous calculations for their use were removed

Individual Filtering : only the most recent sales data on any property were kept (for houses that were sold multiple times during this period)

Missing Values : were replaced by 0

Feature Engineering : No feature was created. All features are directly taken from the kaggle dataset. Categorical features were transformed using an ordinal encoder.


Model Training

Used Algorithm : We used a RandomForestRegressor algorithm (scikit-learn) but this model could be challenged with other interesting models such as XGBRegressor, Neural Networks, ...

Parameters Choice : We did not perform any hyperparameter optimisation and chose to use n_estimators=50. Future works should be planned to perform gridsearch optimizations

Metrics : Mean Squared Error metric

Validation Strategy : We splitted our data into train (75%) and test (25%)


Model analysis

Model used : RandomForestRegressor

Library : sklearn.ensemble._forest

Library version : 0.24.1

Model parameters :

Parameter key Parameter value
base_estimator DecisionTreeRegressor()
n_estimators 50
estimator_params ('criterion', 'max_depth', 'min_samples_split', 'min_samples_leaf', 'min_weight_fraction_leaf', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'min_impurity_split', 'random_state', 'ccp_alpha')
bootstrap True
oob_score False
n_jobs None
random_state None
verbose 0
warm_start False
class_weight None
max_samples None
criterion mse
max_depth None
Parameter key Parameter value
min_samples_split 2
min_samples_leaf 1
min_weight_fraction_leaf 0.0
max_features auto
max_leaf_nodes None
min_impurity_decrease 0.0
min_impurity_split None
ccp_alpha 0.0
n_features_in_ 72
n_features_ 72
n_outputs_ 1
base_estimator_ DecisionTreeRegressor()
estimators_ [DecisionTreeRegressor(max_features='auto', random_state=662305423), DecisionTreeRegressor(max_features='auto', random_state=661015781), DecisionTreeRegressor(max_features='auto', random_state=1578391283), DecisionTreeRegressor(max_features='auto', random_state=1906048284),...

Dataset analysis

Global analysis

Training dataset Prediction dataset
number of features 72 72
number of observations 1,095 365
missing values 0 0
% missing values 0 0

Univariate analysis

INFO:numexpr.utils:NumExpr defaulting to 8 threads.

1stFlrSF - Numeric

First Floor square feet
Training dataset Prediction dataset
count 1,095 365
mean 1,180 1,120
std 400 341
min 334 483
25% 886 864
50% 1,100 1,050
75% 1,420 1,320
max 4,690 2,630

Target analysis

SalePrice - Numeric

Training dataset Prediction dataset
count 1,095 365
mean 182,000 177,000
std 78,500 82,000
min 34,900 40,000
25% 130,000 126,000
50% 165,000 160,000
75% 215,000 205,000
max 755,000 745,000

Multivariate analysis


Model explainability

Note : the explainability graphs were generated using the test set only.

Global feature importance plot

Features contribution plots

1stFlrSF -

First Floor square feet

Model performance

Univariate analysis of target variable

SalePrice - Numeric

True values Prediction values
count 365 365
mean 177,000 177,000
std 82,000 70,500
min 40,000 66,100
25% 126,000 128,000
50% 160,000 157,000
75% 205,000 200,000
max 745,000 524,000

Metrics

Mean absolute error : 16,100

Mean squared error : 626,000,000